Friday, February 16, 2018

Fina Print

  • I am not an expert in any of the software that is covered in this document although I frequently use them.
  • If you mess up your files while using R or RStudio command prompt, git, and etc. or crash/burn/blow up your computer, please be aware that I do not accept any responsibility.
  • Please use the information contained at your own risk!!!

Important Notes

  • The entire R mini BootCamp workshop is version controlled by Git.
  • All the files of this workshop are hosted on Bitbucket in a remote public git repository which can be found here.
    • To download the remote git public repository, please click here.
    • For the detailed software installation, please see Installation_and_Software_Notes.html file.
    • Please feel free to use and share any of the content without the permission of the repository owner.
  • This document is prepared with RStudio using R and R Markdown.
  • All comments, suggestions, and other correspondences should be sent to Omer Kara.

Introduction

Introduction

  • This presentation intents to introduce you the basic information, concepts, and tools of data analysis and linear regression in R such as
    • information about R and related software
    • R basics, objects, and pacakges
    • importing, subsetting, and manipulating local and remote data
    • descriptive statistics and exploratory data analysis
    • linear regression
  • At the end of this workshop, you will
    • have a basic knowledge of R programming
    • import/download/scrap, load, transform, and tidy the data
    • perform statistical analysis and create powerful graphics
    • conduct basic exploratory data and regression analyses

Necessary Packages

  • First of all, let's install and load some of the necessary R packages we will use in this presentation.
  • The following example presents installing and loading any R package.
install.packages("rmarkdown") ## Intalling packages.
library("rmarkdown") ## Loading packages.
  • In the development of this document the following packages are used: devtools, rmarkdown, knitr, checkpoint, rvest, pastecs, psych, magrittr, ggplot2, plotly, gapminder, stargazer, leaflet, DT, gvlma.
    • These packages are pre-installed and loaded for you as long as R interactive session is started from the R mini BootCamp.Rproj file, the main file of this workshop.
  • Also, the development version of the ggplot2 package is installed from GitHub.
devtools::install_github('hadley/ggplot2')
library("ggplot2")

R and RStudio

What is R?

  • R is a sequentially interpreted object-oriented programming language for statistical computing, data mining, web scraping, graphics, and more.
    • Sequential interpretation means that R cannot handle two procedures at the same time.
    • In R, you can perform simple calculations, vector and matrix operations, data manipulation, create your own functions and procedures, and do almost anything you want with data in an easy and ordered way.
  • R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand in 1993.
    • R is named partly after the first names of the first two R authors and partly as a play on the name of S programming language.
  • R is currently developed by the R Development Core Team and supported by The R Foundation.

Why to use R?

  • Some advantages of using R are as following
    • R matches and exceeds the most of the features of the current available statistical packages.
    • R is open-source and completely free.
    • It gives you high flexibility and freedom in the sense of coding options. You can start your coding from scratch or use some built-in functions from R packages.
    • In R, you can write code and save it for replication, debugging and modification.
    • The programming language used in R is very similar across methods.
  • Some disadvantages of R are as following
    • R has a very steep learning curve.
    • It has a bad graphical user interface (GUI) which is considered not user-friendly. For this reason, we will be using RStudio which is the integrated development environment (IDE) for R.
    • R compels you to type in commands for every task.

RStudio

  • RStudio is the best IDE available for R and it makes R easier to use.
  • It is open-source and free.
  • It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for data viewing, plotting, history checking, debugging, workspace management, R packages, and git.
  • The features of RStudio that I like most are
    • syntax highlighting, code completion, and smart indentation
    • quickly jumping to function definitions
    • integrated R help and documentation
    • R Project feature for easily managing multiple working directories
    • workspace browser and data viewer
    • authoring with Sweave and R Markdown
  • Note that we will use R as our main programming software, but to make coding easier and more fun we will use RStudio as our IDE, which will use R at the background.
  • If you want to see my RStudio preferences settings in details, please see RStudio Preferences section in the Installation_and_Software_Notes.html file.

Downloading and Installing

Getting Help

R Basics

Console

  • After starting RStudio, console shows some basic information about the program’s version, license and citation information.
  • Use the console, only for short codes, quickly observing results, and error/warning messages and etc.
  • To start, let's use it as a calculator
> 10 + 5
#> [1] 15
> (2 * ((10 + 5 - 3) * 2 - 9)) / 2^1
#> [1] 15
  • After each command, you need to hit the enter key to get a result.
  • In this document, the printed results are right under the command and starts with #> symbol.
  • Please note the [1] at the beginning of each result. It shows the dimension of the result.

Editor

  • As an alternative, R commands can be run on editor which allows you to save your code, manipulate, and run whenever necessary.
  • Let's see how we can use editor as a calculator.
15 + 5
#> [1] 20

((10 - 2)/4)^2
#> [1] 4

log(10) ## Takes the natural logarithm of the input.
#> [1] 2.302585

log(exp(5)) ## Note the exponential function written as "exp".
#> [1] 5

sin(pi/6) ## Sine function and π.
#> [1] 0.5

Editor

  • In R, the # sign is used to comment out your code, text and etc.
  • After the # sign, all of the codes, texts and etc. will be ignored by R and printed as text.
  • The convention is to use # for a full comment line and to use ## for a code line with a command.
# This is a command line which can be very long if you wish.

3 * 3 ## This is a code line with a command at the end.
#> [1] 9

# This is a comment line for a section.

## This is a comment line for a subsection.

R Objects

Creating Objects

  • R regards everything as objects.
  • A number, string, vector, matrix, data frame, plot, function, and etc. are all R objects.
  • The best way to create any R object is by using "<-" sign.
x <- 1:10 ## ":" operator generates regular sequences.
print(x) ## Prints the object. I rarely use this function.
#>  [1]  1  2  3  4  5  6  7  8  9 10

y <- 6
y ## Also prints the object.
#> [1] 6

z <- "R mini BootCamp" ## A character string. Note that strings in R are contained within double quotes.
z
#> [1] "R mini BootCamp"

R Objects

Creating Objects

  • R programming language is case sensitive.
  • You can override a R object by creating another R object with the same name.
X <- "NCSU" ## With capital letter name.
X
#> [1] "NCSU"

x <- 6 ## With lower case name.
x
#> [1] 6

x <- 10:15 ## Overriding it with different values.
x
#> [1] 10 11 12 13 14 15

R Objects

Vectors

  • Vectors are one of the most used objects in R.
  • They consist of one or more values of the same type.
x <- c("NCSU", "Wolfpack") ## Vector with character value.
is.numeric(x) ## Checks whether the object is numeric.
#> [1] FALSE

y <- c(5, 6, 7, 8, 9) ## Concatenate function which created a vector with numeric values.
class(y) ## Gives the class of an object.
#> [1] "numeric"

str(y) ## Gives the details of object structure (class of the object and its values).
#>  num [1:5] 5 6 7 8 9

z <- c(TRUE, FALSE) ## Logical vector.
length(z) ## Gives the length of a vector.
#> [1] 2

R Objects

Vectors

  • Note that a vector should consist of only the same type of elements.
a <- c(5, 6, 7, 8, "d") ## Vector with numeric and character values.
str(a) ## Note tha class of vector a.
#>  chr [1:5] "5" "6" "7" "8" "d"

b <- c("a", TRUE) ## Vector with logical and character values.
str(b)
#>  chr [1:2] "a" "TRUE"

x <- c(TRUE, 2) ## Vector with logical and numeric values.
str(x) ## Numeric (TRUE will be converted into number 1).
#>  num [1:2] 1 2

str(c(FALSE, 2)) ## Numeric (FALSE will be converted into number 0).
#>  num [1:2] 0 2

R Objects

Vectors

  • You can create a new vector by combining two or more vectors.
a <- c(5:15)
b <- c(10:20)
c(b, a) ## Combining two vectors.
#>  [1] 10 11 12 13 14 15 16 17 18 19 20  5  6  7  8  9 10 11 12 13 14 15
  • Vectors are very useful to perform simultaneous operations (vectorized operations).
a <- c(1:3)
2 + a
#> [1] 3 4 5

c(1:3) / c(1:3) ## Vectors with same length.
#> [1] 1 1 1

c(1:3) / c(1:4) ## Vectors with different length.
#> [1] 1.00 1.00 1.00 0.25

R Objects

Vectors

  • There are some operations that are specific to vectors.
a <- rnorm(n = 10000, mean = 0, sd = 1) ## Random number generator for the normal distribution.

head(x = a, n = 5) ## Prints the first 5 elements of a vector.
tail(x = a, n = 5) ## Prints the last 5 elements of a vector.

min(a) ## Minimum value.
max(a) ## Maximum value.
sum(a) ## Total of all elements in a vector.

mean(a) ## Mean.
var(a) ## Variance.
sd(a) ## Standard deviation.

sort(a, decreasing = FALSE, na.last = TRUE) ## Sorts the value of a vector alphabetically.
unique(a) ## Gives you the unique values in a vector.
sort(unique(a)) ## Unique values are sorted.

R Objects

Factors

  • Factors are used to represent categorical data.
    • They are one-dimensional data structure like vectors.
    • One can think of a factor as an integer vector where each integer has a label (level).
a <- c("yes", "yes", "no", "yes", "no") ## A character vector.
b <- factor(x = a) ## Creates the factor and gives you the "Levels".
b
#> [1] yes yes no  yes no 
#> Levels: no yes

is.factor(b) ## Checks whether the object is a factor.
#> [1] TRUE

str(b) ## Note that the levels are automatically identified by alphabetic order.
#>  Factor w/ 2 levels "no","yes": 2 2 1 2 1

R Objects

Logicals

  • Another key element in the R language is logical objects.
    • Logical objects in R are one-dimensional data structure like vectors.
    • They can take only two values which are TRUE and FALSE.
a <- TRUE
a <- c(TRUE, FALSE, FALSE)
a
#> [1]  TRUE FALSE FALSE

is.logical(a) ## Checks whether the vector is logical.
#> [1] TRUE

str(a)
#>  logi [1:3] TRUE FALSE FALSE

x <- "TRUE"
str(x)
#>  chr "TRUE"

R Objects

Logicals

  • Generally, logical objects are created while performing element comparison using logical operators.
  • The special operators available in R are equal (==), not equal (!=), and (& , &&) and or (|, ||).
36 == 36 ## Checks equality of two numeric objects.
#> [1] TRUE

"NCSU" != "ncsu" ## Checks non-equaility.
#> [1] TRUE

1 < 0 ## Smaller than.
#> [1] FALSE

-2:2 >= 0 ## Elementwise evaluation.
#> [1] FALSE FALSE  TRUE  TRUE  TRUE

(-2:2 >= 0) & (-2:2 <= 0) ## Elementwise evalutation. Note that the lengths of two vectors are same.
#> [1] FALSE FALSE  TRUE FALSE FALSE

1:6 %in% 3:10 ## whether there is a match in the right object for the elements of the left object.
#> [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

R Objects

Matrices

  • Unlike the vectors, matrices have 2 dimensions, which are rows and columns.
matrix(data = 1:6, nrow = 2, ncol = 3, byrow = FALSE) ## Default is to fill the matrix by column.
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

x <- matrix(data = 1:4, nrow = 2, byrow = TRUE, dimnames = list(c("row1", "row2"), c("col1", "col2")))
x
#>      col1 col2
#> row1    1    2
#> row2    3    4

is.matrix(x) ## Checks whether the object is a matrix.
#> [1] TRUE

str(x)
#>  int [1:2, 1:2] 1 3 2 4
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:2] "row1" "row2"
#>   ..$ : chr [1:2] "col1" "col2"

R Objects

Matrices

  • Note that a matrix with dimension names has very similar characteristics to R data frames.
  • So, most of the functions can also be used for data frames.
t(x) ## Transpose.
#>      row1 row2
#> col1    1    3
#> col2    2    4

nrow(x) ## Number of rows.
#> [1] 2
colnames(x) ## Gives the column names.
#> [1] "col1" "col2"

rowSums(x) ## Row sums.
#> row1 row2 
#>    3    7
colMeans(x) ## Column means.
#> col1 col2 
#>    2    3

R Objects

Matrices

  • You can merge matrices using rbind and cbind functions.
x <- matrix(data = 1:6, nrow = 1, ncol = 3)
y <- matrix(data = 10:15, nrow = 1, ncol = 3)

rbind(x, y)
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]   10   11   12

rbind(x, 1, 0) ## Recycling by columns.
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    1    1    1
#> [3,]    0    0    0

cbind(x, y)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    2    3   10   11   12

R Objects

Data Frames

  • Data frames are the most common way of storing data in R.
    • They are not matrices but they share many characteristics of matrices.
    • Actually, data frames are lists with equal-length vectors and some additional structure.
  • An empty data frame can be created by using the data.frame function.
my.data <- data.frame() ## Creates an empty data frame.

is.data.frame(my.data) ## Checks whether the object is a data frame.
#> [1] TRUE

str(my.data)
#> 'data.frame':    0 obs. of  0 variables

R Objects

Data Frames

  • You can create a non-empty data frame by supplying some vectors as input in the data.frame function.
sample.size <- 30 ## Defining the sample size for later use.
column.1 <- round(rnorm(n = sample.size, mean = 5, sd = 1), digits = 2)
column.2 <- sample(x = c(-50:50, NA), size = sample.size, replace = TRUE, prob = NULL)
column.3 <- sample(x = c("NCSU", "CALS", "Economics"), size = sample.size, replace = TRUE, prob = NULL)
column.4 <- factor(sample(x = c("Yes", "No"), size = sample.size, replace = TRUE, prob = NULL))
my.data <- data.frame(Column.1 = column.1, Column.2 = column.2, Column.3 = column.3, Column.4 = column.4)
my.data

R Objects

Data Frames

  • Note that a data frame has very similar characteristics to a matrix with dimension names.
  • So, most of the functions can also be used for matrices.
data.1 <- data.frame(c(1:5), c(6:10), c(11:15), c(16:20), stringsAsFactors = FALSE)

column.names <- paste("Column", ".", 1:ncol(data.1), sep = "")
colnames(data.1) <- column.names ## Assignes the column names to the data frame by using the colnames.

head(x = data.1, n = 2) ## Prints the first 2 elements of a data frame.
# tail(x = data.1, n = 2) ## Prints the last 2 elements of a data frame.

R Objects

Lists

  • A list is a generic vector containing different types of objects.
  • Lists are similar to vectors, except that each entry can be any R object or even another list.
a <- list(1, "a", TRUE)

a
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] "a"
#> 
#> [[3]]
#> [1] TRUE

str(a)
#> List of 3
#>  $ : num 1
#>  $ : chr "a"
#>  $ : logi TRUE

R Objects

Lists

  • You can add names to the values of a list.
a <- list(Numeric = 1, Character = "a", Logical = TRUE, Complex = 1 + 4i)

a
#> $Numeric
#> [1] 1
#> 
#> $Character
#> [1] "a"
#> 
#> $Logical
#> [1] TRUE
#> 
#> $Complex
#> [1] 1+4i

is.list(a) ## Checks whether the object is a list.
#> [1] TRUE

Working with RStudio

Working Directory

  • R is always pointed at a directory on your computer file system.
    • This directory is generally called as working directory.
    • While loading and writing data sets or any kind of R objects, R uses this pre-specified working directory as the base path for file operations.
# R code chunk is not evaluated.

R.home() ## Gives you the home directory of R software itself.

getwd() ## Gives the current working directory.
my.current.dir <- getwd() ## Assigns the current working directory to an object.

setwd("Path of Working Directory") ## Sets the working directory to a new one.

setwd("~") ## Changes the working directory to home directory.
setwd("../") ## Double dots are used for moving up in the folder hierarchy.
setwd("./") ## A single dot represents the current directory itself.
setwd("/") ## Forward slash changes the working directory to the root.

Working with RStudio

Getting Help

  • To get help for the functions and data sets in R, use help or ?.
  • The related help information will be shown in the "Help" tab in RStudio.
  • Note that the package that contains the function or data sets you seek help for should be installed and loaded already.
# R code chunk is not evaluated.

help(lm) ## Opens the help page for "lm" function which is for fitting linear models.
?lm
?"lm"

??lm ## Gives the search results for word "lm".
??errorsarlm ## If the package that contains the function is not installed, then you should use "??".

?":" ## Help for operator.
?"%in%"

Working with RStudio

Packages

  • To use R packages, you need to first install the package, which needs to be done just once.
  • Then, load the package which needs to be performed every time you re-start R or Rstudio.
# R code chunk is not evaluated.

install.packages("tidyr") ## Installs single package.
library("tidyr") ## Loads single package.

install.packages(c("RColorBrewer", "stringr")) ## Installs multiple packages.
lapply(c("RColorBrewer", "stringr"), library, character.only = TRUE) ## Loads multiple packages.

R Details

Missing Values

  • Missing values in R appears as NA.
    • It is an indicator of missingness.
    • NA is not a string or a numeric value.
  • We can create vectors, factors, logicals, matrices, arrays, data frames, and lists with missing values.
    • You can use is.na function to logically check whether the object has missing values.
    • You can use also complete.cases function to logically check whether the object has non-missing values.
x <- c(1:3, NA, 5:7, NA) ## Numeric vector.
is.na(x) ## Checks the missing values.
#> [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

complete.cases(x) ## Checks the non-missing values.
#> [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

y <- c("a", "b", "c", NA, "NA") ## Character vector.
is.na(y) ## Note that "NA" is not missing, it is a character string with values "NA".
#> [1] FALSE FALSE FALSE  TRUE FALSE

Missing Values

  • NA can arise when you load a data set with empty cells.
  • Note that data frames are very similar to matrices so the below code applies to data frames as well.
x <- data.frame(c(NA, 1:2, NA), c(NA, 4, NA, 5), c(7:8, NA, NA), c(10:13), stringsAsFactors = FALSE)
colnames(x) <- paste0("Column", ".", 1:ncol(x))
x
is.na(x)
#>      Column.1 Column.2 Column.3 Column.4
#> [1,]     TRUE     TRUE    FALSE    FALSE
#> [2,]    FALSE    FALSE    FALSE    FALSE
#> [3,]    FALSE     TRUE     TRUE    FALSE
#> [4,]     TRUE    FALSE     TRUE    FALSE

Subsetting

Vectors

  • Since vectors have only one dimension, you should use only one index.
  • In general for vectors, use [index].
  • You can also subset the previously subsetted vectors. For instance, use [index.1][index.2].
a <- c(1:10)
a[3] ## Selecting the 3rd element.
#> [1] 3

a[c(2, 3, 5)] ## Selecting the 2nd, 3rd and the 5th elements.
#> [1] 2 3 5

a[c(2, 3, 5)][c(1, 2)][1] ## Subsetting three times.
#> [1] 2

a[-1]
#> [1]  2  3  4  5  6  7  8  9 10

a[-c(1:4)]
#> [1]  5  6  7  8  9 10

Subsetting

Data Frames

  • Since data frames are two-dimensional R objects, you can use up to two indices for subsetting.
  • For subsetting columns in data frames, use [, index.2] and $.
x <- data.frame(c(1:5), c(6:10), c(11:15), c(16:20), stringsAsFactors = FALSE)
colnames(x) <- paste0("Column", ".", 1:ncol(x))

x[, 1] ## Column 1 values only.
#> [1] 1 2 3 4 5

x[, "Column.1"]
#> [1] 1 2 3 4 5

x$Column.1 ## Same as above. 
#> [1] 1 2 3 4 5

Subsetting

Data Frames

  • You can use [index.1, index.2] for subsetting rows and columns simultaneously.
    • Note that index.1 is for rows, and index.2 is for columns.
x[1, 1]
#> [1] 1

x[c(1, 3), c(2, 4)]

Coercion

  • In R, you can convert the class of an object into other classes by explicit coercion.
# R code chunk is not evaluated.

x <- c(0:6)
class(x) ## The class of x is integer.

as.numeric(x) ## Coerces x as a numeric.
as.character(x) ## Coerces x as a character.
as.complex(x) ## Coerces x as a complex.
as.factor(x) ## Coerces x as a factor.
as.logical(x) ## Coerces x as a logical (0 is FALSE and everything greater than 0 is TRUE).
as.matrix(x) ## Coerces x as a matrix.
as.array(x) ## Coerces x as an arrray.
as.data.frame(x) ## Coerces x as a data frame.
as.list(x) ## Coerces x as a list.

Data

Downloading Data

  • Let's download the Shiller Index data as a csv file from this link.
file.name <- paste0(data.path, "shiller", ".csv")
fileURL <- "https://goo.gl/3DHEjM"  ## Address.
download.file(fileURL, file.name, method = "auto")  ## Downloading.
data.shiller <- read.csv(file.name, header = TRUE, sep = ",", dec = ".", colClasses = c("Date", 
    "numeric"), na.string = "")  ## Importing.
datatable(data.shiller, options = list(searching = FALSE, pageLength = 4, lengthMenu = c(5, 
    10, 15, 20)))

Scraping Data

  • Now, suppose you want to scrape all of the cryptocurrencies listed in Coinmarketcap.
market.cap.html <- read_html("http://coinmarketcap.com/all/views/all/") ## Address.
data.crypto <- market.cap.html %>%
    html_node("table.table") %>% ## Css of the table.
    html_table(header = TRUE, trim = TRUE, fill = TRUE, dec = ".")
datatable(data.crypto, options = list(searching = FALSE, pageLength = 4, lengthMenu = c(5, 10, 15, 20)))

Importing Data

Data and Metadata

  • In this presentation, we will use a cross-sectional data on wages.
  • The data and the corresponding metadata are contained in wage1.xls and wage1_metadata.txt files respectively.
    • Before importing data, it is very important to open and perform a visual check on both the raw data and metadata files.
    • Now, go to the R/Repo/Data folder and open both files.

Importing Data

Necessary Package

  • Since the raw data file, wage1.xls, is an excel file, we need to use a R package which is capable of importing excel files.
  • In R, there are several packages for importing data from an excel file such as XLConnect, xlsx, and readxl.
  • In this presentation, we will use only the XLConnect package.
  • The first step in importing an excel file is installing and loading the XLConnect package.
install.packages("XLConnect") ## Installing.
library("XLConnect") ## Loading.

Importing Data

Methods

  • The second step is importing the data from an excel file.
  • You can import the data in two different ways.

Method 1

data <- readWorksheetFromFile(file = "file.path", ...)

Method 2

  • First, use the loadWorkbook function to load an excel file as a workbook.
  • Second, use the readWorksheet function to read data from the workbook.
workbook <- loadWorkbook(filename = "file.path")
data <- readWorksheet(object = workbook, sheet = "Sheet Name or Index", ...)

Importing Data

Methods

  • Let's import the wage1.xls excel file with both methods.
  • Note the sheet and header arguments in both methods.
    • sheet argument specifies the sheet name or index you want to load in the excel file.
    • header argument specifies whether the first row should be used as column names.
  • You can also use other arguments to import a certain region of a sheet in the excel file.
    • Typing ?readWorksheet in console shows the other arguments that can be used.

Method 1

data <- readWorksheetFromFile(file = paste0(data.path, "wage1.xls"), sheet = "WAGE1", header = FALSE)

Method 2

workbook <- loadWorkbook(paste0(data.path, "wage1.xls"))
data <- readWorksheet(workbook, sheet = "WAGE1", header = FALSE)

Importing Data

Other File Formats

  • If the raw data file is not an excel file format, you can use read.csv and read.delim functions.
data.csv <- read.csv(paste0(data.path, "wage1.csv"), header = FALSE, sep = ",", 
    dec = ".", colClasses = "numeric", na.string = "")  ## Comma separated values.

data.txt <- read.csv(paste0(data.path, "wage1.txt"), header = FALSE, sep = "", 
    dec = ".", colClasses = "numeric", na.string = "")  ## Text file. No specific separation.

data.tab <- read.csv(paste0(data.path, "wage1_tab.txt"), header = FALSE, sep = "\t", 
    dec = ".", colClasses = "numeric", na.string = "")  ## Tab delimited text file.

data.tab <- read.delim(paste0(data.path, "wage1_tab.txt"), header = FALSE, sep = "\t", 
    dec = ".", colClasses = "numeric", na.string = "")  ## Same as above.

all.equal(data, data.csv)
#> [1] "Names: 24 string mismatches"

all.equal(data, data.tab)
#> [1] "Names: 24 string mismatches"

Printing Data

  • Let's print the imported data.
  • As it can be seen from the below dynamic paged data table,
    • data has 526 rows (observations) and 24 columns (variables)
    • all variables are in numeric class which is indicated with <dbl> sign in the table
    • column names are generic and are needed to be renamed for further use.

Subsetting Data

  • In this presentation we will use only a part of the imported data.
  • Thus, before manipulating the data, we will subset the raw data with specific variables only.
  • We will be particularly interested in the following variables
    • wage: average hourly earnings (Col1)
    • educ: years of education (Col2)
    • exper: years potential experience (Col3)
    • tenure: years with current employer (Col4)
    • female: =1 if female (Col6)
data <- data[, c(1:4, 6)] ## Subsetting with column index numbers.
# data <- data[, c("Col1", "Col2", "Col3", "Col4", "Col6")] ## Subsetting with column names.

Manipulating Data

  • Since the column names are generic (i.e., Col1, Col2, …, Col24), we need to assign names to each variables.
colnames(data) <- c("Wage", "Education", "Experience", "Tenure", "Gender") ## New names.
  • For further use, let's perform the following data manipulations.
    • convert Gender variable to factor class where Male is the base level.
    • create Ln.Wage variable, natural logarithm of Wage.
    • create Experience.Sq variable, square of Experience.
data$Gender <- factor(data$Gender, labels = c("Female", "Male")) ## "0" is Male, "1" is Female.
data$Ln.Wage <- log(data$Wage) ## Column binds as the last column.
data$Experience.Sq <- data$Experience^2 ## Column binds as the last column.
  • Finally, let's re-arrange the column orders.
col.order <- c("Wage", "Ln.Wage", "Education", "Experience", "Experience.Sq", "Tenure", "Gender")
data <- data[, col.order]

Manipulating Data

  • Now, let's print the subsetted and manipulated data, i.e., modified data.

Missing Values

  • Let's check the missing values in each column before the descriptive statistics of the modified data.
    • In R, missing values are represented with NA.
    • You should use is.na function to check whether a value is missing.
sapply(data, function(missing) sum(is.na(missing)))  ## Checks number of missing values for all columns.
#>          Wage       Ln.Wage     Education    Experience Experience.Sq 
#>             0             0             0             0             0 
#>        Tenure        Gender 
#>             0             0
sum(sapply(data, function(missing) sum(is.na(missing))))  ## Checks number of missing values in all columns.
#> [1] 0

Structure of Data

  • Let's check the structure of the modified data.
str(data)
#> 'data.frame':    526 obs. of  7 variables:
#>  $ Wage         : num  3.1 3.24 3 6 5.3 ...
#>  $ Ln.Wage      : num  1.13 1.18 1.1 1.79 1.67 ...
#>  $ Education    : num  11 12 11 8 12 16 18 12 12 17 ...
#>  $ Experience   : num  2 22 2 44 7 9 15 5 26 22 ...
#>  $ Experience.Sq: num  4 484 4 1936 49 ...
#>  $ Tenure       : num  0 2 0 28 2 8 7 3 4 21 ...
#>  $ Gender       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 2 2 1 ...
  • You can also use dim, nrow, ncol, rownames, colnames, head, and tail functions to explore the structure of the data.
# R code chunk is not evaluated.

dim(data); nrow(data); ncol(data)
rownames(data); colnames(data)
head(data); tail(data)

Descriptive Statistics

Descriptive Statistics

  • Using some built-in R functions, the descriptive statistics of the manipulated data can be acquired.
  • Let's get the descriptive statistics for only Wage variable.
# R code chunk is not evaluated.

min(data$Wage); max(data$Wage) 
mean(data$Wage); median(data$Wage)
var(data$Wage); sd(data$Wage)
range(data$Wage)
skewness(data$Wage); kurtosis(data$Wage) ## moments and e1071 packages.
  • Loop functions such as sapply can be used to get a descriptive statistics of all variables.
sapply(data, function(col) mean(col))
#>          Wage       Ln.Wage     Education    Experience Experience.Sq 
#>      5.896103      1.623268     12.562738     17.017110    473.435361 
#>        Tenure        Gender 
#>      5.104563            NA

Descriptive Statistics

base Package

summary(data)
#>       Wage           Ln.Wage          Education       Experience   
#>  Min.   : 0.530   Min.   :-0.6349   Min.   : 0.00   Min.   : 1.00  
#>  1st Qu.: 3.330   1st Qu.: 1.2030   1st Qu.:12.00   1st Qu.: 5.00  
#>  Median : 4.650   Median : 1.5369   Median :12.00   Median :13.50  
#>  Mean   : 5.896   Mean   : 1.6233   Mean   :12.56   Mean   :17.02  
#>  3rd Qu.: 6.880   3rd Qu.: 1.9286   3rd Qu.:14.00   3rd Qu.:26.00  
#>  Max.   :24.980   Max.   : 3.2181   Max.   :18.00   Max.   :51.00  
#>  Experience.Sq        Tenure          Gender   
#>  Min.   :   1.0   Min.   : 0.000   Female:274  
#>  1st Qu.:  25.0   1st Qu.: 0.000   Male  :252  
#>  Median : 182.5   Median : 2.000               
#>  Mean   : 473.4   Mean   : 5.105               
#>  3rd Qu.: 676.0   3rd Qu.: 7.000               
#>  Max.   :2601.0   Max.   :44.000

Descriptive Statistics

pastecs Package

knitr::kable(stat.desc(data)[-c(1:3, 7, 10:11, 14), ], align = "c", digits = 3)
Wage Ln.Wage Education Experience Experience.Sq Tenure Gender
min 0.530 -0.635 0.000 1.000 1.000 0.000 NA
max 24.980 3.218 18.000 51.000 2601.000 44.000 NA
range 24.450 3.853 18.000 50.000 2600.000 44.000 NA
median 4.650 1.537 12.000 13.500 182.500 2.000 NA
mean 5.896 1.623 12.563 17.017 473.435 5.105 NA
var 13.639 0.283 7.667 184.204 379511.161 52.193 NA
std.dev 3.693 0.532 2.769 13.572 616.045 7.224 NA

Descriptive Statistics

psych Package

knitr::kable(as.data.frame(describe(data))[, -c(1, 6, 7)], align = "c", digits = 3)
n mean sd median min max range skew kurtosis se
Wage 526 5.896 3.693 4.650 0.530 24.980 24.450 2.002 4.940 0.161
Ln.Wage 526 1.623 0.532 1.537 -0.635 3.218 3.853 0.390 0.374 0.023
Education 526 12.563 2.769 12.000 0.000 18.000 18.000 -0.618 1.866 0.121
Experience 526 17.017 13.572 13.500 1.000 51.000 50.000 0.705 -0.652 0.592
Experience.Sq 526 473.435 616.045 182.500 1.000 2601.000 2600.000 1.488 1.280 26.861
Tenure 526 5.105 7.224 2.000 0.000 44.000 44.000 2.104 4.629 0.315
Gender* 526 1.479 0.500 1.000 1.000 2.000 1.000 0.083 -1.997 0.022

Descriptive Statistics

stargazer Package

stargazer(data, type = "text", flip = TRUE, mean.sd = TRUE, min.max = TRUE, 
    iqr = TRUE, median = TRUE)  ## Use 'type = 'html'' for HTML output.
#> 
#> ==================================================================
#> Statistic  Wage  Ln.Wage Education Experience Experience.Sq Tenure
#> ------------------------------------------------------------------
#> N          526     526      526       526          526       526  
#> Mean      5.896   1.623   12.563     17.017      473.435    5.105 
#> St. Dev.  3.693   0.532    2.769     13.572      616.045    7.224 
#> Min       0.530  -0.635      0         1            1         0   
#> Pctl(25)  3.330   1.203     12         5           25         0   
#> Median    4.650   1.537     12        13.5        182.5       2   
#> Pctl(75)  6.880   1.929     14         26          676        7   
#> Max       24.980  3.218     18         51         2,601       44  
#> ------------------------------------------------------------------

Exploratory Data Analysis

Histogram

base Package

hist(data$Wage)

Histogram

base Package

hist(data$Wage, col = "lightblue", breaks = 50, xlim = c(0, max(data$Wage)), 
    ylim = c(0, 80), xlab = "Wage", main = "Histogram of Wage")
rug(data$Wage)  ## Plots all of the points under the histogram.
abline(v = mean(data$Wage), col = "red", lwd = 1, lty = 2)  ## Vertical line.

Density Plot

base Package

hist(data$Wage, col = "red", breaks = 100, xlim = c(0, max(data$Wage)), ylim = c(0, 
    0.7), freq = FALSE, density = 10, xlab = "Wage", main = "Density of Wage")

Bar Plot

base Package

barplot(table(data$Education), col = "wheat", xlab = "Education", ylab = "Number of Employees", 
    main = "Number of Employees by Education")

Bar Plot

base Package

barplot(table(data$Gender, data$Education), col = c("blue", "red"), xlab = "Education", 
    ylab = "Number of Employees", main = "Number of Employees by Gender and Education")
legend("topright", pch = 15, col = c("blue", "red"), legend = c("Male", "Female"))

Box Plot

base Package

boxplot(data$Wage ~ data$Gender, col = c("blue", "red"), names = c("Male", "Female"), 
    ylab = "Wage", main = "Box Plots of Wage for Male and Female")

Scatter Plot

base Package

plot(x = data$Education, y = data$Wage, axes = FALSE, col = "red", bg = "green", 
    cex = 0.75, pch = 21, xlab = "Education", ylab = "Wage", main = "Relationship Between Wage and Education")
axis(side = 1, at = c(0:max(data$Education)))
axis(side = 2)
box()

Scatter Plot Matrix

base Package

plot(data[, c("Wage", "Education", "Experience")], cex = 0.75)

Line Plot

base Package

  • For the line plot, we need a time-series data frame. We will use a built-in R data frame called AirPassengers, which shows the monthly airline passenger numbers 1949-1960.
plot(AirPassengers, type = "l", xlab = "Date", ylab = "Air Passengers")

Interactive Plot

dygraph Packages

Interactive Plot

ggplot2 and plotly Packages

Interactive Thematic Map

plotly Package

Interactive Map

leaflet Package

Animation

ggplot2, plotly and gapminder Packages

Linear Regression

Simple Linear Regression

Model Estimation

  • In R, you need to use lm function for linear regressions.
model <- lm(formula = data$Wage ~ data$Education, singular.ok = FALSE) ## Sometimes useful.
model <- lm(data = data, formula = Wage ~ Education, singular.ok = FALSE) ## Better for printing.
  • For quick model estimation results, use the attributes of model object.
# R code chunk is not evaluated.

attributes(model) ## Gives the model attributes for quick model estimation results.
model$coefficients; coef(model) ## Coefficients.
confint(model) ## Confidence interval of coefficients.
model$residuals; residuals(model) ## Residuals.
model$df.residual ## Degrees of freedom for residuals.
model$fitted.values; fitted(model) ## Fitted values.

Simple Linear Regression

Model Summary

summary(model) ## Prints detailed model estimation results.
#> 
#> Call:
#> lm(formula = Wage ~ Education, data = data, singular.ok = FALSE)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.3396 -2.1501 -0.9674  1.1921 16.6085 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.90485    0.68497  -1.321    0.187    
#> Education    0.54136    0.05325  10.167   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.378 on 524 degrees of freedom
#> Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
#> F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Simple Linear Regression

Model Summary

  • Use the attributes of model summary object, to get the quick model estimation results.
# R code chunk is not evaluated.

model.sum <- summary(model)
str(model.sum) ## Structure of model summary object.
model.sum$coefficients ## Coefficients.
model.sum$r.squared ## R squared.
model.sum$adj.r.squared ## Adjusted R squared.
model.sum$sigma ## Sigma (residual standard error).
model.sum$cov.unscaled ## Variance-covariance matrix.
model.sum$df ## Degrees of freedoms
model.sum$fstatistic ## F-statistics of the model.

Simple Linear Regression

Model Summary

stargazer(model, type = "text", style = "qje")  ## For a nice tabulated printing, use stargazer package.
#> 
#> ==========================================================
#>                                      Wage                 
#> ----------------------------------------------------------
#> Education                          0.541***               
#>                                    (0.053)                
#>                                                           
#> Constant                            -0.905                
#>                                    (0.685)                
#>                                                           
#> N                                    526                  
#> R2                                  0.165                 
#> Adjusted R2                         0.163                 
#> Residual Std. Error            3.378 (df = 524)           
#> F Statistic                103.363*** (df = 1; 524)       
#> ==========================================================
#> Notes:              ***Significant at the 1 percent level.
#>                      **Significant at the 5 percent level.
#>                      *Significant at the 10 percent level.

Simple Linear Regression

Estimated Regression Line

with(data, plot(Education, Wage, main = "Estimated Regression Line"))
abline(model, col = "red")

Simple Linear Regression

ANOVA

  • For the analysis of variance (ANOVA), you should use the anova function.
print.data.frame(anova(model))
#>            Df   Sum Sq    Mean Sq  F value       Pr(>F)
#> Education   1 1179.732 1179.73205 103.3627 2.782597e-22
#> Residuals 524 5980.682   11.41352       NA           NA

Simple Linear Regression

Model Diagnosis

par(mfrow = c(2,2)) # Change the panel layout to 2 x 2.
plot(model)

par(mfrow = c(1,1)) # Change back to 1 x 1.

Simple Linear Regression

Model Diagnosis

  • Some important R functions for model diagnosis are listed below.
# R code chunk is not evaluated.

plot(gvlma(model))  ## Model diagnosis plots (from gvlma package).
summary(gvlma(model))  ## Checks all the linear regression assumptions (from gvlma package).
gvlma(model)  ## Similar to above but without model estimation results.

vif(model)  ## For Multicollinearity
outlierTest(model)  ## For Outliers.

outlierTest(model)  ## Bonferonni p-value for most extreme observations.
qqPlot(model, main = "QQ Plot")  # Normality of residuals (qq plot for studentized residuals).
shapiro.test(residuals(model))  ## Normality check with Shapiro-Wilk test.
ad.test(residuals(model))  ## Normality check with Anderson-Darling test.

ncvTest(model)  ## Non-constant error variance test (homoskedasticity).

Multiple Linear Regression

Model Estimation

  • You can run a multiple linear regression by adding other variable with + sign to the formula argument in the lm function.
model <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE)
stargazer(model, type = "html", style = "qje")  ## For a nice tabulated printing, use stargazer package.
Wage
Education 0.599***
(0.051)
Experience 0.022*
(0.012)
Tenure 0.169***
(0.022)
Constant -2.873***
(0.729)
N 526
R2 0.306
Adjusted R2 0.302
Residual Std. Error 3.084 (df = 522)
F Statistic 76.873*** (df = 3; 522)
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Multiple Linear Regression

Dummy Variables

  • You can add a dummy variable to a linear regression like other explanatory variables.
  • The only important point is that you need to know the base level of the dummy variable.
    • In our modified data frame, Male is the base level of the dummy variable Gender.
    • You can change the base level of a dummy variable by re-defining the base level with ref argument in the relevel function, e.g., relevel(data$Gender, ref = "Female").
model.1 <- lm(data = data, formula = Wage ~ Education + Experience + Gender, 
    singular.ok = FALSE)  ## Base level: Male.

data$Gender <- relevel(data$Gender, ref = "Female")

model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Gender, 
    singular.ok = FALSE)  ## Base level: Female.

data$Gender <- relevel(data$Gender, ref = "Male")  ## Revert the base level back to Male.

Multiple Linear Regression

Dummy Variables

stargazer(model.1, model.2, type = "html", style = "all2")
Dependent variable:
Wage
(1) (2)
Education 0.603*** 0.603***
(0.051) (0.051)
Experience 0.064*** 0.064***
(0.010) (0.010)
GenderMale -2.156*** -2.156***
(0.270) (0.270)
Constant -1.734** -1.734**
(0.754) (0.754)
Observations 526 526
R2 0.309 0.309
Adjusted R2 0.305 0.305
Residual Std. Error (df = 522) 3.078 3.078
F Statistic (df = 3; 522) 77.920*** (p = 0.000) 77.920*** (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Multiple Linear Regression

Squared Variables

  • While using squared variables, you need to be very careful.
model.1 <- lm(data = data, formula = Wage ~ Experience + Experience^2) ## Not correct.
model.2 <- lm(data = data, formula = Wage ~ Experience + I(data$Experience^2)) ## Correct.
model.3 <- lm(data = data, formula = Wage ~ Experience + Experience.Sq) ## Correct.
Dependent variable:
Wage
(1) (2) (3)
Experience 0.031*** 0.298*** 0.298***
(0.012) (0.041) (0.041)
Experience2) -0.006***
(0.001)
Experience.Sq -0.006***
(0.001)
Constant 5.373*** 3.725*** 3.725***
(0.257) (0.346) (0.346)
Observations 526 526 526
R2 0.013 0.093 0.093
Adjusted R2 0.011 0.089 0.089
Residual Std. Error 3.673 (df = 524) 3.524 (df = 523) 3.524 (df = 523)
F Statistic 6.766*** (df = 1; 524) (p = 0.010) 26.740*** (df = 2; 523) (p = 0.000) 26.740*** (df = 2; 523) (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Multiple Linear Regression

Model Comparison

  • Finally, let's run several multiple linear regressions with different specifications and compare them.
model.1 <- lm(data = data, formula = Ln.Wage ~ log(Experience), singular.ok = FALSE)

model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE)

model.3 <- lm(data = data, formula = Ln.Wage ~ Education + Experience + Tenure, singular.ok = FALSE)

model.4 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure + Gender, singular.ok = FALSE)

Multiple Linear Regression

Model Comparison

Dependent variable:
Ln.Wage Wage Ln.Wage Wage
(1) (2) (3) (4)
log(Experience) 0.117***
(0.021)
Education 0.599*** 0.092*** 0.572***
(0.051) (0.007) (0.049)
Experience 0.022* 0.004** 0.025**
(0.012) (0.002) (0.012)
Tenure 0.169*** 0.022*** 0.141***
(0.022) (0.003) (0.021)
GenderFemale 1.811***
(0.265)
Constant 1.343*** -2.873*** 0.284*** -3.379***
(0.056) (0.729) (0.104) (0.703)
Observations 526 526 526 526
R2 0.055 0.306 0.316 0.364
Adjusted R2 0.053 0.302 0.312 0.359
Residual Std. Error 0.517 (df = 524) 3.084 (df = 522) 0.441 (df = 522) 2.958 (df = 521)
F Statistic 30.421*** (df = 1; 524) (p = 0.00000) 76.873*** (df = 3; 522) (p = 0.000) 80.391*** (df = 3; 522) (p = 0.000) 74.398*** (df = 4; 521) (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Thank You